- Vojtěch Kulíšek
- Lukáš Plevač
- Pavel Šesták
Z dostupných datových sad si zvolte jednu datovou sadu, kterou se budete dále zabývat. Stáhněte si zvolenou datovou sadu z uvedeného zdroje a prostudujte si dostupné informace k této datové sadě. Proveďte explorativní analýzu zvolené datové sady. Pro každý následující bod implementujte odpovídající sekci ve zdrojovém kódu a zjištěné výsledky popište v dokumentaci: prozkoumejte jednotlivé atributy datové sady, jejich typ a hodnoty, kterých nabývají (počet hodnot, nejčastější hodnoty, rozsah hodnot atd.) prozkoumejte rozložení hodnot jednotlivých atributů pomocí vhodných grafů, zaměřte se i na to, jak hodnota jednoho či dvou atributů ovlivní rozložení hodnot jiného atributu. Do dokumentace vložte alespoň 5 různých grafů, zobrazujících zjištěná rozložení hodnot. Použijte různé typy grafů (např. bodový graf, histogram, krabicový nebo houslový graf, graf složený z více podgrafů apod.). zjistěte, zda zvolená datová sada obsahuje nějaké odlehlé hodnoty. proveďte podrobnou analýzu chybějící hodnot (celkový počet chybějících hodnot, počet objektů s více chybějícími hodnotami atd.). proveďte korelační analýzu numerických atributů (k analýze využijte i grafy a korelační koeficienty). Připravte 2 varianty datové sady vhodné pro dolovací algoritmy. Můžete uvažovat dolovací úlohu uvedenou u datové sady nebo navrhnout vlastní dolovací úlohy. V případě vlastní dolovací úlohy ji specifikujte v dokumentaci. V rámci přípravy datové sady proveďte následující kroky: Odstraňte z datové sady atributy, které jsou pro danou dolovací úlohu irelevantní. Vypořádejte se s chybějícími hodnotami. Pro odstranění těchto hodnot využijte alespoň dvě různé metody pro odstranění chybějících hodnot. Vypořádejte se s odlehlými hodnotami, jsou-li v datové sadě přítomny. Pro jednu variantu datové sady proveďte diskretizaci numerických atributů tak, aby výsledná datová sada byla vhodná pro algoritmy, které vyžadují na vstupu kategorické atributy. Pro druhou variantu datové sady proveďte vhodnou transformaci kategorických atributů na numerické atributy. Dále pak proveďte normalizaci numerických atributů, které má smysl normalizovat. Výsledná datová sada by měla být vhodná pro metody vyžadující numerické vstupy.
import subprocess
import sys
import os
requirementsPath = os.path.join(os.path.dirname(os.path.realpath('__file__')),"requirements.txt")
subprocess.check_call([sys.executable, "-m", "pip", "install", "-r", requirementsPath])
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import pandas as pd
from sklearn.linear_model import LinearRegression
import re
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
PLOT_GRAPHS = True
PLOT_STATS = True
Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: numpy in /usr/lib/python3.10/site-packages (from -r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 1)) (1.23.4) Requirement already satisfied: matplotlib in /usr/lib/python3.10/site-packages (from -r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 2)) (3.6.2) Requirement already satisfied: pandas in /home/vojta/.local/lib/python3.10/site-packages (from -r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 3)) (1.5.1) Requirement already satisfied: seaborn in /home/vojta/.local/lib/python3.10/site-packages (from -r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 4)) (0.12.1) Requirement already satisfied: sklearn in /home/vojta/.local/lib/python3.10/site-packages (from -r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 5)) (0.0) Requirement already satisfied: scikit-learn in /home/vojta/.local/lib/python3.10/site-packages (from -r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 6)) (1.1.3) Requirement already satisfied: contourpy>=1.0.1 in /usr/lib/python3.10/site-packages (from matplotlib->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 2)) (1.0.6) Requirement already satisfied: cycler>=0.10 in /usr/lib/python3.10/site-packages (from matplotlib->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 2)) (0.11.0) Requirement already satisfied: fonttools>=4.22.0 in /usr/lib/python3.10/site-packages (from matplotlib->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 2)) (4.38.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/lib/python3.10/site-packages (from matplotlib->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 2)) (1.4.4) Requirement already satisfied: packaging>=20.0 in /usr/lib/python3.10/site-packages (from matplotlib->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 2)) (21.3) Requirement already satisfied: pillow>=6.2.0 in /usr/lib/python3.10/site-packages (from matplotlib->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 2)) (9.3.0) Requirement already satisfied: pyparsing>=2.2.1 in /usr/lib/python3.10/site-packages (from matplotlib->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 2)) (3.0.9) Requirement already satisfied: python-dateutil>=2.7 in /usr/lib/python3.10/site-packages (from matplotlib->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 2)) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /home/vojta/.local/lib/python3.10/site-packages (from pandas->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 3)) (2022.5) Requirement already satisfied: joblib>=1.0.0 in /home/vojta/.local/lib/python3.10/site-packages (from scikit-learn->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 6)) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /home/vojta/.local/lib/python3.10/site-packages (from scikit-learn->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 6)) (3.1.0) Requirement already satisfied: scipy>=1.3.2 in /usr/lib/python3.10/site-packages (from scikit-learn->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 6)) (1.9.3) Requirement already satisfied: six>=1.5 in /usr/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->-r /home/vojta/projekty/but-fit-upa/project2/requirements.txt (line 2)) (1.16.0)
V této části se blíže seznámíme s daty, které máme dále upravovat. V rámcí modelu CRISP-DM jsme v sekci pochopení dat. Pro tuto úlohu jsme si zvolili datovou sadu z průzkumu platů v IT sektoru z roků 2018 až 2020. Dále budeme pracovat pouze s nejnovějšími daty z roku 2020 jelikož ekonomická situace je v dnešní době velmi dynamická a už tak se jedná o stará data. Starší data můžeme dále použít pro validaci našich klasifikátorů a porovnat jak moc se datové sady vzájemně liší.
V této sekci si nahrajeme zvolený datový soubor do operační paměti pomocí knihovny pandas. Z datové odstraníme atributy, které zjevně nepocházejí od uživatele jako je například časová značka. Jelikož jsou to data z dotazníku, tak některé otázky jsou rozsáhle popsány, aby uživatel věděl co přesně má vyplnit, pro naše účely si tyto sloupce přejmenujeme, aby se s daty dále lépe pracovalo.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
FILES = ["data/IT_Salary_Survey_EU_2018.csv", "data/IT_Salary_Survey_EU_2019.csv", "data/IT_Salary_Survey EU_2020.csv"]
data = pd.read_csv(FILES[2])
if "Timestamp" in data:
data.drop(["Timestamp"], axis=1, inplace=True)
if "Zeitstempel" in data:
data.drop(["Zeitstempel"], axis=1, inplace=True)
if "0" in data:
data.drop(["0"], axis=1, inplace=True)
data.rename(columns = {
"Your main technology / programming language":'Main Technology',
"Other technologies/programming languages you use often" : "Other technologies",
"Yearly brutto salary (without bonus and stocks) in EUR": "Yearly brutto",
"Annual bonus+stocks one year ago. Only answer if staying in same country" : "Bonus and stocks in same country",
"Have you lost your job due to the coronavirus outbreak?" : "Job lost due covid",
"Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR" : "Home office compensation",
"Position " : "Position",
}, inplace = True)
data.sample(10)
| Age | Gender | City | Position | Total years of experience | Years of experience in Germany | Seniority level | Main Technology | Other technologies | Yearly brutto | Yearly bonus + stocks in EUR | Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country | Bonus and stocks in same country | Number of vacation days | Employment status | Сontract duration | Main language at work | Company size | Company type | Job lost due covid | Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week | Home office compensation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 261 | 38.0 | Male | Berlin | Backend Developer | 11 | 1 | Senior | Javascript | Python, Javascript / Typescript, SQL, AWS, Docker | 35000.0 | NaN | NaN | NaN | NaN | Founder | Unlimited contract | English | up to 10 | Startup | No | NaN | NaN |
| 301 | 39.0 | Male | Munich | DBA | 17 | 5 | Senior | Oracle | SQL | 64000.0 | 70000 | NaN | NaN | 28 | Full-time employee | Unlimited contract | English | 101-1000 | Product | No | NaN | 500 |
| 446 | 34.0 | Female | Berlin | QA Engineer | 6 | 2 | Senior | Ruby | Ruby, Java / Scala, AWS, Kubernetes, Docker | 67500.0 | 0 | 61400.0 | 0 | 30 | Full-time employee | Unlimited contract | English | 1000+ | Product | No | NaN | NaN |
| 424 | 42.0 | Male | Den Haag | Backend Developer | 20 | 6 | Senior | .NET | Python, Javascript / Typescript, SQL, AWS, Docker | 75000.0 | 0 | 75000.0 | 0 | 25 | Full-time employee | Unlimited contract | English | 101-1000 | Product | No | 40.0 | 300 |
| 804 | 20.0 | Male | Berlin | Mobile Developer | 5 | 2 | Lead | C# | .NET | 240000.0 | 15000+- | NaN | NaN | 27 | Full-time employee | Unlimited contract | English | 1000+ | Product | Yes | 0.0 | 0 |
| 702 | 37.0 | Male | Stuttgart | Backend Developer | NaN | NaN | Senior | C# | NaN | 70000.0 | NaN | NaN | NaN | 30 | Full-time employee | Unlimited contract | NaN | NaN | NaN | No | NaN | NaN |
| 113 | NaN | Male | Berlin | QA Engineer | 6 | 6 | Middle | Javascript | Javascript / Typescript, AWS | 50000.0 | NaN | 45000.0 | NaN | 30 | Full-time employee | Unlimited contract | English | 101-1000 | Product | No | 0.0 | NaN |
| 328 | 29.0 | Male | Bonn | Mobile Developer | 5 | 2 | Middle | Objective-C | Swift | 55000.0 | 4000 | 55000.0 | 1800 | 24 | Full-time employee | Unlimited contract | English | 11-50 | Product | No | NaN | NaN |
| 821 | 35.0 | Male | Berlin | Software Engineer | 15 | 4 | Senior | Java | Python, SQL, AWS, Docker | 85000.0 | 0 | 85000.0 | 0 | 25 | Full-time employee | Unlimited contract | English | 11-50 | Product | No | NaN | NaN |
| 9 | 35.0 | Male | Berlin | Software Engineer | 15 | 3 | Lead | Java | NaN | 95000.0 | NaN | 90000.0 | NaN | 30 | Full-time employee | Unlimited contract | English | 101-1000 | Product | No | 0.0 | NaN |
#pd.plotting.parallel_coordinates(data, "Your level")
#plt.show()
#note1 i found whole line with NaN filter it
def plot_graphs(data: pd.DataFrame) -> None:
"""
Get pandas dataframe. Describe and plot graphs for all columns in dataframe.
PRE CONDITION: If you want just text info about params specify PLOT_STATS. If you want pyplot graphs as output define constant PLOT_GRAPHS to True.
"""
if not PLOT_STATS and not PLOT_GRAPHS:
return
columns = data.columns
dtypes = data.dtypes
for column in columns:
print(data[column].describe())
if not PLOT_GRAPHS:
continue
figure, axes = plt.subplots(nrows=1, ncols=2, figsize=(12, 5))
figure.suptitle("Data for: "+ column, fontsize=15)
axes[0].set_title("NaN values", fontsize=12)
IsNan = data[column].isna().sum()
IsNotNan = len(data[column])-IsNan
axes[0].bar("Unfilled", IsNan)
axes[0].bar("Filled", IsNotNan)
axes[1].set_title("Data distribution", fontsize=12)
if dtypes[column] == "object":
data[column].value_counts().plot(kind='bar')
elif dtypes[column] == "float64":
data[column].plot(kind='box')
plt.show()
plot_graphs(data)
count 1226.000000 mean 32.509788 std 5.663804 min 20.000000 25% 29.000000 50% 32.000000 75% 35.000000 max 69.000000 Name: Age, dtype: float64
count 1243 unique 3 top Male freq 1049 Name: Gender, dtype: object
count 1253 unique 119 top Berlin freq 681 Name: City, dtype: object
count 1247 unique 148 top Software Engineer freq 387 Name: Position, dtype: object
count 1237 unique 48 top 10 freq 138 Name: Total years of experience, dtype: object
count 1221 unique 53 top 2 freq 195 Name: Years of experience in Germany, dtype: object
count 1241 unique 24 top Senior freq 565 Name: Seniority level, dtype: object
count 1126 unique 256 top Java freq 184 Name: Main Technology, dtype: object
count 1096 unique 562 top Javascript / Typescript freq 44 Name: Other technologies, dtype: object
count 1.253000e+03 mean 8.027904e+07 std 2.825061e+09 min 1.000100e+04 25% 5.880000e+04 50% 7.000000e+04 75% 8.000000e+04 max 1.000000e+11 Name: Yearly brutto, dtype: float64
count 829 unique 168 top 0 freq 227 Name: Yearly bonus + stocks in EUR, dtype: object
count 8.850000e+02 mean 6.322459e+05 std 1.680508e+07 min 1.100000e+04 25% 5.500000e+04 50% 6.500000e+04 75% 7.500000e+04 max 5.000000e+08 Name: Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country, dtype: float64
count 614 unique 131 top 0 freq 200 Name: Bonus and stocks in same country, dtype: object
count 1185 unique 45 top 30 freq 488 Name: Number of vacation days, dtype: object
count 1236 unique 11 top Full-time employee freq 1190 Name: Employment status, dtype: object
count 1224 unique 3 top Unlimited contract freq 1159 Name: Сontract duration, dtype: object
count 1237 unique 14 top English freq 1020 Name: Main language at work, dtype: object
count 1235 unique 5 top 1000+ freq 448 Name: Company size, dtype: object
count 1228 unique 63 top Product freq 760 Name: Company type, dtype: object
count 1233 unique 10 top No freq 1162 Name: Job lost due covid, dtype: object
count 373.000000 mean 12.967828 std 15.275174 min 0.000000 25% 0.000000 50% 0.000000 75% 30.000000 max 40.000000 Name: Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week, dtype: float64
count 462 unique 59 top 0 freq 161 Name: Home office compensation, dtype: object
Z vygenerovaných grafů je vidět, že většina dat je vyplněná. V případě, že chybí v jistém sloupci mnoho dat tak se jedná většinou o věci jako bonusy, různé kompenzace a podobně. U těchto atributů se dá předpokládat, že když například uživatel nedostal kompenzaci za práci z domu tak položku nevyplnil. Pro další práci s datovou sadou si podrobně projdeme datové sloupce a zjistíme jaké konkrétní chyby se vyskytují v datové sadě, aby jsme se na ně mohli soustředit v další části čištění dat.
Většina hodnot je vyplněná. Dokonce jsou i v rozumném rozpětí, kde většina respondentů je ve věku 30 až 35 let. Jedná se o numerický atribut.
Pohlaví také většina respondentů vyplnila a je rozděleno do tří kategorií, zde nebude nutné nějaké významné čištění, jen doplníme chybějící hodnoty nejčastější hodnotou. Jelikož se jedná o datovou sadu z technického oboru, tak není překvapující, že rozložení pohlaví je výrazně nevyvážené.
Tento kategorický atribut bude potřeba pro další práci nějak shluknout. Podle hodnot je zřejmé, že sběr dat pochází z Německa.
Tento kategorický atribut bude také pro další práci potřeba vyčistit, konkrétně odfiltrovat hodnoty s nízkou četností.
Tento zřejmě numerický atribut byl knihovnou pandas interpretován jako kategorický z důvodu pár textových odpovědí. Pro naši další práci bude potřeba převést na numerický a textové odpovědi vyfiltrovat.
Stejný případ jako u Total years of experience.
Kategorický atribut, který se zjevně rozpadá do čtyř kategorií a dále mnoho odlehlých hodnot, které můžeme nejspíše shluknout.
Pro nás se jedná o velmi významný sloupec, bohužel je velmi nešťastně zadaný. Pro naši další práci rozdělíme řetězce na jednotlivé technologie a zobrazíme si jejich histogram, z kterého vyčteme nejvíce používané technologie.
Obdobný problém jako u main technology jen zde se uživatelé rozepsali ještě více a zde se jedná v této formě již o úplně nepoužitelný atribut
Numerický atribut, z kterého zatím bohužel nic nevyčteme jelikož obsahuje dost výraznou anomálii, které se před další prací budeme muset zbavit.
Zjevně numerický atribut, který je interpretován jako kategorický. Bude nutné převést na numerický. Mnoho nezadaných hodnot, ale z podstaty věci se dá předpokládat, že když někdo nevyplní bonus tak žádný nemá.
Roční plat před jedním rokem, opět graf zatížen odlehlou hodnotou, kterou bude potřeba vyfiltrovat.
Obdobně jako u Yearly bonus + stocks in EUR.
Bude nutné převést na numerický atribut, jelikož naše datová sada obsahuje většinu lidí co jsou zaměstnanci tak chybějící hodnoty budeme doplňovat střední hodnotou.
Většina respondentů jsou zaměstnanci na plný úvazek. Můžeme pouze očistit o málo frekventované kategorie, zde jich ale naštěstí není tolik.
Většina smluv je na dobu neurčitou. Vydíme jednu odlehlou hodnotu 0, které by bylo vhodné se zbavit a pár chybějících hodnot, které si můžeme v tomhle sloupci dovolit nahradit nejčetnější hodnotou, jelikož jasně převládá a chybějících hodnot je málo.
Kategorický atribut, kde převládají dvě hodnoty. Zbytek je možno rozdělit separátorem a případně oddělat nesmyslné hodnoty.
Nezašumělý kategorický atribut, kde pár chybějících hodnot můžeme nahradit nejčetnější hodnotou.
Mnoho hodnot v tomto kategorickém atributu má malý výskyt. Dominují zde tři kategorie a zbytek bude vhodné shluknout.
Zjevně binární atribut, který obsahuje mnoho slovních odpovědí, bude nutné očistit.
Mnoho chybějících hodnot, šlo by z tohoto vyčíst průměrnou dobu v práci, proto chybějící hodnoty nahradíme 40ti hodinami jako standardní pracovní týden, což je asi výchozí hodnota a člověk který nemá zkrácený pracovní úvazek tuto hodnotu zřejmě ignoroval.
Další zjevně numerický atribut, který bude potřeba očistit od slovních hodnot.
korelační analýza slouží pro hledání podobností mezi atributy. Když najdeme silnější korelaci mezi atributy, tak můžeme využít regresi pro dopočítání chybějících hodnot mezi takto korelovanými atributy a daný odhad bude daleko přesnější než například medián ze souboru. Korelační analýza pracuje nad numerickými atributy. Jak jsme si ukázali při analýze datových atributů tak mnoho numerických je interpretováno jako kategorické atributy z důvodu například nějaké textové odpovědi. Takto detekované atributy si pomocí knihovny pandas převedeme na numerické, kde špatné hodnoty převedeme na NaN.
#
data['Total years of experience'] = pd.to_numeric(data['Total years of experience'], errors="coerce")
data['Age'] = pd.to_numeric(data['Age'], errors="coerce")
data['Years of experience in Germany'] = pd.to_numeric(data['Years of experience in Germany'], errors="coerce")
data['Yearly bonus + stocks in EUR'] = pd.to_numeric(data['Yearly bonus + stocks in EUR'], errors="coerce")
data['Bonus and stocks in same country'] = pd.to_numeric(data['Bonus and stocks in same country'], errors="coerce")
data['Number of vacation days'] = pd.to_numeric(data['Number of vacation days'], errors="coerce")
data['Home office compensation'] = pd.to_numeric(data['Home office compensation'], errors="coerce")
data.sample(5)
| Age | Gender | City | Position | Total years of experience | Years of experience in Germany | Seniority level | Main Technology | Other technologies | Yearly brutto | Yearly bonus + stocks in EUR | Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country | Bonus and stocks in same country | Number of vacation days | Employment status | Сontract duration | Main language at work | Company size | Company type | Job lost due covid | Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week | Home office compensation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 58 | 29.0 | Male | Berlin | Software Engineer | 5.0 | 3.0 | Senior | Go | Java / Scala, AWS, Docker | 75000.0 | NaN | 58000.0 | NaN | 30.0 | Full-time employee | Unlimited contract | English | 101-1000 | Product | No | NaN | NaN |
| 877 | 33.0 | Male | Munich | Backend Developer | 12.0 | 0.5 | Middle | PHP | PHP, Javascript / Typescript, SQL, Docker | 56000.0 | 0.0 | NaN | NaN | 28.0 | Full-time employee | Unlimited contract | English | 101-1000 | Product | No | NaN | NaN |
| 1056 | 30.0 | Male | Berlin | Software Engineer | 10.0 | 2.0 | Middle | iOS | NaN | 71000.0 | NaN | 65000.0 | NaN | 27.0 | Full-time employee | Unlimited contract | English | 101-1000 | Product | No | 0.0 | NaN |
| 187 | 46.0 | Male | Munich | Scrum Master | 12.0 | 8.0 | Senior | Agile | C/C++, PHP, .NET, SQL | 85000.0 | 10000.0 | 88000.0 | 10000.0 | 30.0 | Full-time employee | Unlimited contract | German | 1000+ | Product | No | 38.0 | NaN |
| 49 | 31.0 | Male | Berlin | Software Engineer | 7.0 | 3.0 | Lead | Python | Python, SQL, Google Cloud, Kubernetes, Docker | 76000.0 | 10000.0 | 69000.0 | NaN | 27.0 | Full-time employee | Unlimited contract | English | 101-1000 | Startup | No | NaN | NaN |
cov_matrix = data.corr('spearman')#pd.DataFrame.corr(data)
#print(cov_matrix)
sn.heatmap(cov_matrix, annot=True)
plt.show()
Z heat mapy můžeme vidět závislosti mezi věkem a odpracovanými roky což je pochopitelné a tuto závislost tedy budeme moci využít. Další silná závislost je mezi sloupci ohledně investic, které takto dopočítávat ale nebudeme.
Máme za sebou explorativní analýzu, v které jsme identifikovali jisté nedostatky v datové sadě. Nyní se pokusíme data očistit a doplnit.
Zde odstraníme neužitečné atributy a zahodíme záznamy, které obsahují míň jak šest vyplněných hodnot, jelikož takové záznamy mají malou přídanou hodnotu pro datovou sadu a mnoho dat by bylo bráno jako průměr či nějak korelováno.
data2 = data
if "Timestamp" in data.columns:
data2 = data.drop('Timestamp', axis=1)
if 'Are you getting any Stock Options?' in data2.columns:
data2 = data2.drop('Are you getting any Stock Options?', axis=1)
print("Records: ",len(data2))
data2 = data2.dropna(thresh=6)
print("Records: ",len(data2))
Records: 1253 Records: 1249
V této části se pokusíme doplnit do datové sady chybějící hodnoty pomocí různých přístupů. Kde půjde odhadnout hodnotu na základě vysoké korelace s jiným atributem tak využijeme regresi. U sloupců, kde to dává smysl tak použijeme střední hodnotu. Jsou sloupce kde můžeme očekávat, že není vyplněno jelikož se to daného respondenta netýkalo (například chybějící položka bonusy ve firmě, tak zřejmě žádné bonusy nemá).
Zde vytvoříme prediktory pro korelované atributy. Jako trénovací data vezmeme všechny záznamy bez chybějících hodnot.
#convert object types to float, errors coerce specified if its not a numeric type replace with NaN
data2['Total years of experience'] = pd.to_numeric(data2['Total years of experience'], errors="coerce")
data2['Age'] = pd.to_numeric(data2['Age'], errors="coerce")
dataLearn = data2.dropna()
AgePredict = LinearRegression()
AgePredict = AgePredict.fit(dataLearn[['Total years of experience']].values, dataLearn[['Age']].values)
YearOfExperiencePredict = LinearRegression()
YearOfExperiencePredict = YearOfExperiencePredict.fit(dataLearn[['Age']].values, dataLearn[['Total years of experience']].values)
YearOfExperienceInGermanyPredict = LinearRegression()
YearOfExperienceInGermanyPredict = YearOfExperienceInGermanyPredict.fit(dataLearn[['Total years of experience']].values, dataLearn[['Years of experience in Germany']].values)
YearScore = YearOfExperiencePredict.score(dataLearn[['Age']].values, dataLearn[['Total years of experience']].values)
print(YearScore)
x = np.linspace(0, 30, 30)
plt.plot(dataLearn['Total years of experience'], dataLearn['Age'], 'o')
AgeScore = AgePredict.score(dataLearn[['Total years of experience']].values, dataLearn[['Age']].values)
print(AgeScore)
diabetes_y_pred = AgePredict.predict(np.array([x]).T)
plt.plot(x, diabetes_y_pred, color="red", linewidth=3)
0.5715946738931867 0.571594673893187
[<matplotlib.lines.Line2D at 0x7f13f717ad70>]
salaryColumnName = ""
if "Current Salary" in dataLearn:
salaryColumnName = "Current Salary"
elif "Yearly brutto" in dataLearn:
salaryColumnName = "Yearly brutto"
AgePredictDependsOnSalary = LinearRegression()
AgePredictDependsOnSalary = AgePredictDependsOnSalary.fit(dataLearn[[salaryColumnName]].values, dataLearn[['Age']].values)
x = np.linspace(0, 175000, 175000)
plt.plot(dataLearn[salaryColumnName], dataLearn['Age'], 'o')
AgePredictDependsOnSalary.score(dataLearn[[salaryColumnName]].values, dataLearn[['Age']].values)
diabetes_y_pred = AgePredictDependsOnSalary.predict(np.array([x]).T)
plt.plot(x, diabetes_y_pred, color="red", linewidth=3)
[<matplotlib.lines.Line2D at 0x7f13f5edf3a0>]
Tento prediktor se snažil naučit vztah mezi věkem a platem. Jak vidíme tak zde nějak silný vztah neexistuje. Použití takového prediktoru by vedlo k chybám.
data2.loc[data2['Yearly bonus + stocks in EUR'].isna(), 'Yearly bonus + stocks in EUR'] = 0
data2.loc[data2['Bonus and stocks in same country'].isna(), 'Bonus and stocks in same country'] = 0
data2.loc[data2['Number of vacation days'].isna(), 'Number of vacation days'] = data2['Number of vacation days'].mode()[0]
data2.loc[data2['Job lost due covid'].isna(), 'Job lost due covid'] = "No"
data2.loc[data2['Home office compensation'].isna(), 'Home office compensation'] = 0
data2.loc[data2['Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week'].isna(), 'Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week'] = 40 #standard hours per week
data2.loc[data2['Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week'] == 0, 'Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week'] = 40 #standard hours per week
data2.loc[data2['Gender'].isna(), 'Gender'] = data2['Gender'].mode()[0]
data2.loc[data2['City'].isna(), 'City'] = data2['City'].mode()[0]
data2.loc[data2['Seniority level'].isna(), 'Seniority level'] = data2['Seniority level'].mode()[0]
data2.loc[data2['Main language at work'].isna(), 'Main language at work'] = data2['Main language at work'].mode()[0]
data2.loc[data2['Company size'].isna(), 'Company size'] = data2['Company size'].mode()[0]
data2.loc[data2['Company type'].isna(), 'Company type'] = data2['Company type'].mode()[0]
data2.loc[data2['Employment status'].isna(), 'Employment status'] = data2['Employment status'].mode()[0]
data2.loc[data2['Сontract duration'].isna(), 'Сontract duration'] = data2['Сontract duration'].mode()[0]
data2.drop(data2.loc[data2["Main Technology"].isna()].index, inplace=True)
data2.drop(data2.loc[data2["Other technologies"].isna()].index, inplace=True)
data2.loc[data2['Age'].isna() & data2['Total years of experience'].notna(), 'Age'] = \
data2.loc[data2['Age'].isna() & data2['Total years of experience'].notna(), 'Total years of experience'].apply(lambda exp : np.round(AgePredict.predict([[exp]]))[0][0])
data2 = data2.drop(data2.loc[data2['Age'].isna() & data2['Total years of experience'].isna()].index)
data2.loc[data2['Years of experience in Germany'].isna() & data2['Total years of experience'].notna(), 'Years of experience in Germany'] = \
data2.loc[data2['Years of experience in Germany'].isna() & data2['Total years of experience'].notna(), 'Total years of experience'].apply(lambda exp : np.round(YearOfExperienceInGermanyPredict.predict([[exp]]))[0][0])
data2 = data2.drop(data2.loc[data2['Years of experience in Germany'].isna() & data2['Total years of experience'].isna()].index)
Zde se snažíme snížit počet různých kategorií pro pozici
originalLenght = len(data2['Position'].unique())
for i in data2.index:
if not pd.isnull(data2['Position'][i]):
data2.loc[i, 'Position'] = re.sub("(\s*senior\s*|\s*junior\s*|\s*middle\s*)", " ", data2['Position'][i].lower())
data2.loc[i, 'Position'] = re.sub("(^\s*|\s*$)", " ", data2['Position'][i])
data2.loc[i, 'Position'] = re.sub("\s+", " ", data2['Position'][i])
data2.drop(data2.loc[data2['Position'].isna()].index, inplace=True)
NewLenght = len(data2['Position'].unique())
print("Reduced", originalLenght-NewLenght)
Reduced 10
data2.loc[data2['Total years of experience'].isna() & data2['Age'].notna(), 'Total years of experience'] = \
data2.loc[data2['Total years of experience'].isna() & data2['Age'].notna(), 'Age'].apply(lambda age : np.round(YearOfExperiencePredict.predict([[age]]))[0][0])
data2.drop(data2.loc[data2['Total years of experience'].isna() & data2['Age'].isna()].index, inplace=True)
Zde nejspíše člověk neuvedl druhý plat, jelikož se mu plat nezměnil a přišlo mu to zbytečné. Můžeme křížově doplnit.
originalLenght = len(data2)
currentSalary = "Yearly brutto"
salaryOneYearAgo = "Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country"
data2.loc[data2[currentSalary].isna() & data2[salaryOneYearAgo].notna(), currentSalary] = data2.loc[data2[currentSalary].isna() & data2[salaryOneYearAgo].notna(), salaryOneYearAgo]
data2.loc[data2[salaryOneYearAgo].isna() & data2[currentSalary].notna(), salaryOneYearAgo] = data2.loc[data2[salaryOneYearAgo].isna() & data2[currentSalary].notna(), currentSalary]
data2.drop(data2.loc[data2[currentSalary].isna() & data2[salaryOneYearAgo].isna()].index, inplace=True)
"""
salaries = {
'Current Salary': ['Salary one year ago', 'Salary two years ago'],
'Salary one year ago': ['Salary two years ago', 'Current Salary'],
'Salary two years ago': ['Salary one year ago', 'Current Salary']
}
for i in data2.index:
for j in salaries.keys():
if pd.isnull(data2[j][i]):
for salary in salaries[j]:
if not pd.isnull(data2[salary][i]):
data2.loc[i, j] = data2[salary][i]
break
if pd.isnull(data2['Current Salary'][i]):
data2.drop(i, inplace=True)
"""
NewLenght = len(data2)
print("Reduced", originalLenght-NewLenght)
Reduced 0
data2.Position.str.split(expand=True).stack().value_counts()
'''
print(x.keys()[0], x[0])
dataLearn = data2.dropna()
AgePredict = LinearRegression()
AgePredict = AgePredict.fit(x.keys()[0], dataLearn[['Age']].values)
x = np.linspace(0, 30, 30)
plt.plot(dataLearn['Position'], dataLearn['Age'], 'o')
AgeScore = AgePredict.score(x.keys()[0], dataLearn[['Age']].values)
print(AgeScore)
diabetes_y_pred = AgePredict.predict(np.array([x]).T)
plt.plot(x, diabetes_y_pred, color="red", linewidth=3)
'''
'\nprint(x.keys()[0], x[0])\n\ndataLearn = data2.dropna()\n\nAgePredict = LinearRegression()\nAgePredict = AgePredict.fit(x.keys()[0], dataLearn[[\'Age\']].values)\n\nx = np.linspace(0, 30, 30)\nplt.plot(dataLearn[\'Position\'], dataLearn[\'Age\'], \'o\')\n\nAgeScore = AgePredict.score(x.keys()[0], dataLearn[[\'Age\']].values)\nprint(AgeScore)\n\ndiabetes_y_pred = AgePredict.predict(np.array([x]).T)\nplt.plot(x, diabetes_y_pred, color="red", linewidth=3)\n'
V explorativní analýze jsme narazili na některé nečitelné numerické grafy, jelikož měřítko na ose výrazně ovlivnila odlehlá hodnota. Nyní se takovýchto hodnot pokusíme zbavit pomocí metody zscore. ta předpokládá normální rozdělení. V našem okolí se většina jevů chová podle normálního rozdělení, což nám blíže říká centrální limitní věta. V normálním rozdělení 99.7% hodnot leží v intervalu <-3 σ; 3 σ> a oproto hodnoty mimo tento interval odstraníme.
originalLenght = len(data2)
#print("dtypes: ", data.dtypes)
print("columns: ", data2.select_dtypes(include=["number"]).columns)
def drop_numerical_outliers(df, z_thresh=3):
# Constrains will contain `True` or `False` depending on if it is a value below the threshold.
constrains = df.select_dtypes(include=["number"]) \
.apply(lambda x: np.abs(stats.zscore(x)) < z_thresh) \
.all(axis=1)
# Drop (inplace) values set to be rejected
df.drop(df.index[~constrains], inplace=True)
drop_numerical_outliers(data2)
NewLenght = len(data2)
print("Reduced: ", originalLenght-NewLenght)
data2.head()
columns: Index(['Age', 'Total years of experience', 'Years of experience in Germany',
'Yearly brutto', 'Yearly bonus + stocks in EUR',
'Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country',
'Bonus and stocks in same country', 'Number of vacation days',
'Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week',
'Home office compensation'],
dtype='object')
Reduced: 135
| Age | Gender | City | Position | Total years of experience | Years of experience in Germany | Seniority level | Main Technology | Other technologies | Yearly brutto | Yearly bonus + stocks in EUR | Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country | Bonus and stocks in same country | Number of vacation days | Employment status | Сontract duration | Main language at work | Company size | Company type | Job lost due covid | Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week | Home office compensation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 26.0 | Male | Munich | software engineer | 5.0 | 3.0 | Senior | TypeScript | Kotlin, Javascript / Typescript | 80000.0 | 5000.0 | 75000.0 | 10000.0 | 30.0 | Full-time employee | Unlimited contract | English | 51-100 | Product | No | 40.0 | 0.0 |
| 4 | 37.0 | Male | Berlin | backend developer | 17.0 | 6.0 | Senior | C# .NET | .NET, SQL, AWS, Docker | 62000.0 | 0.0 | 62000.0 | 0.0 | 29.0 | Full-time employee | Unlimited contract | English | 101-1000 | Product | No | 40.0 | 0.0 |
| 5 | 32.0 | Male | Berlin | devops | 5.0 | 1.0 | Senior | AWS, GCP, Python,K8s | Python, AWS, Google Cloud, Kubernetes, Docker | 76000.0 | 5000.0 | 76000.0 | 5000.0 | 30.0 | Full-time employee | Unlimited contract | English | 11-50 | Startup | No | 40.0 | 0.0 |
| 7 | 24.0 | Male | Berlin | frontend developer | 5.0 | 1.0 | Senior | Typescript | Javascript / Typescript | 65000.0 | 0.0 | 65000.0 | 0.0 | 27.0 | Full-time employee | Unlimited contract | English | 1000+ | Product | No | 40.0 | 600.0 |
| 8 | 29.0 | Male | Berlin | backend developer | 8.0 | 2.0 | Senior | PHP | SQL, AWS, Docker | 56000.0 | 0.0 | 55000.0 | 0.0 | 28.0 | Full-time employee | Unlimited contract | English | 101-1000 | Product | No | 30.0 | 0.0 |
V explorativní analýze jsme detekovali, že mnoho lidí zadalo více technologií. Nyní se pokusíme hodnoty rozdělit a vykreslit graf četností použití technologií.
#main technologies
mainTechs = data2["Main Technology"].str.split(pat=r"[,\/\s]+",expand=True).stack().value_counts()
outliers = mainTechs.loc[mainTechs <= 1]
outliersCount = mainTechs.loc[mainTechs <= 1].sum()
del mainTechs['']
plt.rcParams["figure.figsize"] = (20,10)
mainTechs = mainTechs.drop(mainTechs.loc[mainTechs.isin(outliers)].index)
mainTechs["Other"] = outliersCount
mainTechs.plot(kind='bar')
plt.xticks(rotation=90)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 69, 70, 71]),
[Text(0, 0, 'Python'),
Text(1, 0, 'Java'),
Text(2, 0, 'PHP'),
Text(3, 0, 'JavaScript'),
Text(4, 0, 'C++'),
Text(5, 0, 'Javascript'),
Text(6, 0, 'C#'),
Text(7, 0, 'Scala'),
Text(8, 0, 'python'),
Text(9, 0, 'Kotlin'),
Text(10, 0, 'SQL'),
Text(11, 0, 'Swift'),
Text(12, 0, 'Go'),
Text(13, 0, 'Ruby'),
Text(14, 0, 'TypeScript'),
Text(15, 0, 'Typescript'),
Text(16, 0, '.NET'),
Text(17, 0, 'C'),
Text(18, 0, 'JS'),
Text(19, 0, '.net'),
Text(20, 0, 'Android'),
Text(21, 0, 'php'),
Text(22, 0, 'R'),
Text(23, 0, 'Angular'),
Text(24, 0, 'iOS'),
Text(25, 0, 'ABAP'),
Text(26, 0, 'javascript'),
Text(27, 0, 'React'),
Text(28, 0, 'java'),
Text(29, 0, 'Kubernetes'),
Text(30, 0, 'AWS'),
Text(31, 0, 'Embedded'),
Text(32, 0, 'Elixir'),
Text(33, 0, 'NodeJS'),
Text(34, 0, 'Spark'),
Text(35, 0, 'Sql'),
Text(36, 0, '.Net'),
Text(37, 0, 'Objective-C'),
Text(38, 0, 'Php'),
Text(39, 0, 'Terraform'),
Text(40, 0, 'Apache'),
Text(41, 0, 'SAP'),
Text(42, 0, 'Node.js'),
Text(43, 0, 'Golang'),
Text(44, 0, 'Linux'),
Text(45, 0, 'js'),
Text(46, 0, 'go'),
Text(47, 0, 'BI'),
Text(48, 0, 'Rails'),
Text(49, 0, 'on'),
Text(50, 0, 'GCP'),
Text(51, 0, 'Cloud'),
Text(52, 0, 'Azure'),
Text(53, 0, 'Js'),
Text(54, 0, 'c'),
Text(55, 0, '&'),
Text(56, 0, 'Pytorch'),
Text(57, 0, 'scala'),
Text(58, 0, 'Aws'),
Text(59, 0, 'Web'),
Text(60, 0, 'Qlik'),
Text(61, 0, 'Bash'),
Text(62, 0, 'Frontend'),
Text(63, 0, 'c++'),
Text(64, 0, 'core'),
Text(65, 0, 'NLP'),
Text(66, 0, 'Groovy'),
Text(67, 0, 'cloud'),
Text(68, 0, 'kotlin'),
Text(69, 0, 'yaml'),
Text(70, 0, 'QA'),
Text(71, 0, 'Other')])
#other technologies
otherTechs = data2["Other technologies"].str.lower().str.split(pat=r"[,\/\s]+",expand=True).stack().value_counts()
outliers = otherTechs.loc[otherTechs <= 1]
outliersCount = otherTechs.loc[otherTechs <= 1].sum()
del otherTechs['']
otherTechs = otherTechs.drop(otherTechs.loc[otherTechs.isin(outliers)].index)
otherTechs["Other"] = outliersCount
otherTechs.plot(kind='bar')
plt.xticks(rotation=90)
(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46]),
[Text(0, 0, 'docker'),
Text(1, 0, 'sql'),
Text(2, 0, 'python'),
Text(3, 0, 'typescript'),
Text(4, 0, 'javascript'),
Text(5, 0, 'aws'),
Text(6, 0, 'kubernetes'),
Text(7, 0, 'scala'),
Text(8, 0, 'java'),
Text(9, 0, 'google'),
Text(10, 0, 'cloud'),
Text(11, 0, 'kotlin'),
Text(12, 0, 'go'),
Text(13, 0, 'c'),
Text(14, 0, 'c++'),
Text(15, 0, '.net'),
Text(16, 0, 'azure'),
Text(17, 0, 'php'),
Text(18, 0, 'r'),
Text(19, 0, 'ruby'),
Text(20, 0, 'swift'),
Text(21, 0, 'sap'),
Text(22, 0, 'abap'),
Text(23, 0, 'rust'),
Text(24, 0, 'perl'),
Text(25, 0, 'clojure'),
Text(26, 0, 'spark'),
Text(27, 0, 'apache'),
Text(28, 0, 'react'),
Text(29, 0, 'matlab'),
Text(30, 0, 'pytorch'),
Text(31, 0, 'angular'),
Text(32, 0, 'ansible'),
Text(33, 0, 'terraform'),
Text(34, 0, 'elixir'),
Text(35, 0, 'dart'),
Text(36, 0, 'helm'),
Text(37, 0, 'tensorflow'),
Text(38, 0, 'hive'),
Text(39, 0, 'c#'),
Text(40, 0, 'oracle'),
Text(41, 0, 'sas'),
Text(42, 0, 'bash'),
Text(43, 0, 'hadoop'),
Text(44, 0, 'objective-c'),
Text(45, 0, 'linux'),
Text(46, 0, 'Other')])
Nyní si zobrazíme očištěná data. Grafy by měli být čitelnější než v explorativní analýze.
plot_graphs(data2)
count 896.000000 mean 31.930804 std 4.808246 min 22.000000 25% 28.000000 50% 32.000000 75% 35.000000 max 48.000000 Name: Age, dtype: float64
count 896 unique 3 top Male freq 775 Name: Gender, dtype: object
count 896 unique 98 top Berlin freq 494 Name: City, dtype: object
count 896 unique 84 top software engineer freq 303 Name: Position, dtype: object
count 896.000000 mean 8.345536 std 4.737489 min 0.000000 25% 5.000000 50% 8.000000 75% 11.000000 max 25.000000 Name: Total years of experience, dtype: float64
count 896.000000 mean 3.373996 std 2.691107 min 0.000000 25% 1.000000 50% 3.000000 75% 5.000000 max 14.000000 Name: Years of experience in Germany, dtype: float64
count 896 unique 15 top Senior freq 425 Name: Seniority level, dtype: object
count 896 unique 203 top Java freq 145 Name: Main Technology, dtype: object
count 896 unique 475 top Javascript / Typescript freq 37 Name: Other technologies, dtype: object
count 896.00000 mean 68957.67202 std 19088.11889 min 10001.00000 25% 60000.00000 50% 68500.00000 75% 79000.00000 max 154000.00000 Name: Yearly brutto, dtype: float64
count 896.000000 mean 10308.708092 std 21015.467597 min 0.000000 25% 0.000000 50% 0.000000 75% 8000.000000 max 99000.000000 Name: Yearly bonus + stocks in EUR, dtype: float64
count 896.000000 mean 64206.965547 std 17717.972456 min 10001.000000 25% 55000.000000 50% 65000.000000 75% 75000.000000 max 132000.000000 Name: Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country, dtype: float64
count 896.000000 mean 8656.660714 std 19530.577180 min 0.000000 25% 0.000000 50% 0.000000 75% 5000.000000 max 84000.000000 Name: Bonus and stocks in same country, dtype: float64
count 896.000000 mean 28.244420 std 2.434728 min 20.000000 25% 27.000000 50% 29.000000 75% 30.000000 max 38.000000 Name: Number of vacation days, dtype: float64
count 896 unique 10 top Full-time employee freq 872 Name: Employment status, dtype: object
count 896 unique 2 top Unlimited contract freq 854 Name: Сontract duration, dtype: object
count 896 unique 12 top English freq 751 Name: Main language at work, dtype: object
count 896 unique 5 top 1000+ freq 318 Name: Company size, dtype: object
count 896 unique 50 top Product freq 567 Name: Company type, dtype: object
count 896 unique 7 top No freq 851 Name: Job lost due covid, dtype: object
count 896.000000 mean 39.231027 std 2.675082 min 24.000000 25% 40.000000 50% 40.000000 75% 40.000000 max 40.000000 Name: Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week, dtype: float64
count 896.000000 mean 130.878672 std 306.620915 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 1600.000000 Name: Home office compensation, dtype: float64
V této sekci si připravíme dvě datové sady, jedna bude obsahovat pouze numerické atributy a druhé pouze kategorické. Pro převod kategorického atributu na numerický použijeme kódování one hot. Tato metoda vytvoří mnoho sloupců v závislosti na počtu kategorických atributů a počtu různých hodnot. Toto kódování je vhodné pro strojové učení a například v našem případě se odhad pomocí regrese zlepšil díky tomuto kódování z 85% na 94%.
V opačném případě, když budeme připravovat kategorickou sadu, tak musíme převést numerický atribut na kateogorie. Použijeme qcut, kde pomocí kvantilů rozdělíme na určitý počet košů. Mnoho atributů jako napřiklad bonusy mají většinu hodnot rovnou nule. Zde by nešlo použit mnoho košů. Naopak jsou atributy, které klidně můžeme rozdělit do více košů jako například věk. Proto začneme u každého atributz s větším počtem košů a budeme snižovat dokud se nám nepodaří převést (Každý koš musí mít vlastní hodnotu kvantilu).
nData = data2.copy(deep=True)
nData["Other technologies"] = data2["Other technologies"].str.split(pat=r"[,\/\s]+").str.len()
nData.head()
| Age | Gender | City | Position | Total years of experience | Years of experience in Germany | Seniority level | Main Technology | Other technologies | Yearly brutto | Yearly bonus + stocks in EUR | Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country | Bonus and stocks in same country | Number of vacation days | Employment status | Сontract duration | Main language at work | Company size | Company type | Job lost due covid | Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week | Home office compensation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 26.0 | Male | Munich | software engineer | 5.0 | 3.0 | Senior | TypeScript | 3 | 80000.0 | 5000.0 | 75000.0 | 10000.0 | 30.0 | Full-time employee | Unlimited contract | English | 51-100 | Product | No | 40.0 | 0.0 |
| 4 | 37.0 | Male | Berlin | backend developer | 17.0 | 6.0 | Senior | C# .NET | 4 | 62000.0 | 0.0 | 62000.0 | 0.0 | 29.0 | Full-time employee | Unlimited contract | English | 101-1000 | Product | No | 40.0 | 0.0 |
| 5 | 32.0 | Male | Berlin | devops | 5.0 | 1.0 | Senior | AWS, GCP, Python,K8s | 6 | 76000.0 | 5000.0 | 76000.0 | 5000.0 | 30.0 | Full-time employee | Unlimited contract | English | 11-50 | Startup | No | 40.0 | 0.0 |
| 7 | 24.0 | Male | Berlin | frontend developer | 5.0 | 1.0 | Senior | Typescript | 2 | 65000.0 | 0.0 | 65000.0 | 0.0 | 27.0 | Full-time employee | Unlimited contract | English | 1000+ | Product | No | 40.0 | 600.0 |
| 8 | 29.0 | Male | Berlin | backend developer | 8.0 | 2.0 | Senior | PHP | 3 | 56000.0 | 0.0 | 55000.0 | 0.0 | 28.0 | Full-time employee | Unlimited contract | English | 101-1000 | Product | No | 30.0 | 0.0 |
attributes = ["Gender", "City", "Position", "Seniority level", "Main Technology", "Employment status", "Сontract duration", "Main language at work", "Company size", "Company type", "Job lost due covid", "Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week", "Home office compensation"]
one_hot = pd.get_dummies(nData)
nData = one_hot
nData=(nData-nData.min())/(nData.max()-nData.min())
#for attr in attributes:
# pass
# nData[attr] = pd.Categorical(nData[attr])
# nData[attr] = nData[attr].cat.codes
#one_hot.head()
nData.head()
| Age | Total years of experience | Years of experience in Germany | Other technologies | Yearly brutto | Yearly bonus + stocks in EUR | Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country | Bonus and stocks in same country | Number of vacation days | Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week | Home office compensation | Gender_Diverse | Gender_Female | Gender_Male | City_Amsterdam | City_Ansbach | City_Barcelona | City_Basel | City_Berlin | City_Bielefeld | City_Bodensee | City_Boeblingen | City_Bonn | City_Braunschweig | City_Brunswick | City_Brussels | City_Brussels | City_Bucharest | City_Cambridge | City_City in Russia | City_Cologne | City_Copenhagen | City_Cracovia | City_Cracow | City_Darmstadt | City_Den Haag | City_Dortmund | City_Dresden | City_Dublin | City_Duesseldorf | City_Dusseldorf | City_Dusseldurf | City_Düsseldorf | City_Düsseldorf | City_Eindhoven | City_Fr | City_France | City_Frankfurt | City_Friedrichshafen | City_Hamburg | City_Hannover | City_Heidelberg | City_Heidelberg | City_Helsinki | City_Ingolstadt | City_Jena | City_Karlsruhe | City_Kempten | City_Kiev | City_Konstanz | City_Krakow | City_Kyiv | City_Leipzig | City_Lisbon | City_London | City_Lübeck | City_Madrid | City_Mannheim | City_Mannheim | City_Marseille | City_Milan | City_Minsk | City_Moldova | City_Moscow | City_Munich | City_Münster | City_NJ, USA | City_Nuremberg | City_Nuremberg | City_Nürnberg | City_Paderborn | City_Paris | City_Prague | City_Prefer not to say | City_Regensburg | City_Riga, Latvia | City_Rome | City_Rosenheim | City_Saarbrücken | City_Saint-Petersburg | City_Salzburg | City_Samara | City_Sevilla | City_Siegen | City_Sofia | City_Stockholm | City_Stuttgart | City_Stuttgart | City_Tallinn | City_Tampere (Finland) | City_Tuttlingen | City_Ulm | City_Utrecht | City_Vienna | City_Warsaw | City_Warsaw, Poland | City_Wolfsburg | City_Wroclaw | City_Würzburg | City_Zurich | City_Zürich | City_warsaw | Position_ ai management | Position_ analyst | Position_ analytics engineer | Position_ architect | Position_ backend developer | Position_ bi analyst | Position_ bi consultant | Position_ bi consultant sap/ data engineer | Position_ bi developer / data engineer | Position_ bi it consultant | Position_ big data engineer | Position_ business analyst | Position_ c/c++/java developer | Position_ chief research officer | Position_ cloud architect | Position_ cloud engineer | Position_ computational linguist | Position_ computer vision researcher | Position_ consultant | Position_ cto | Position_ data analyst | Position_ data engineer | Position_ data science manager | Position_ data scientist | Position_ databengineer | Position_ dataops team lead | Position_ dba | Position_ designer (ui/ux) | Position_ devops | Position_ devops manager | Position_ director of engineering | Position_ embedded software engineer | Position_ engineering manager | Position_ firmware engineer | Position_ frontend developer | Position_ full-stack developer | Position_ fullstack developer | Position_ hardware engineer | Position_ head of bi | Position_ head of engineering | Position_ head of it | Position_ ios developer | Position_ it operations manager | Position_ it spezialist | Position_ lead developer | Position_ lead software developer | Position_ manager | Position_ marketing analyst | Position_ ml engineer | Position_ mobile developer | Position_ network engineer | Position_ presales engineer | Position_ product analyst | Position_ product management praktikant | Position_ product manager | Position_ program manager | Position_ project manager | Position_ qa engineer | Position_ qa lead | Position_ reporting engineer | Position_ researcher | Position_ robotics engineer | Position_ sap bw consultant | Position_ sap developer | Position_ scrum master | Position_ security engineer | Position_ software architect | Position_ software developer in test | Position_ software engineer | Position_ solution architect | Position_ solutions architect | Position_ sre | Position_ staff engineer | Position_ student, software developer, data engineer | Position_ support engineer | Position_ sw architect | Position_ team lead | Position_ team manager | Position_ tech lead / full-stack | Position_ tech leader | Position_ technical project manager | Position_ testautomation | Position_ testmanager | Position_ vp engineering | Seniority level_C-Level | Seniority level_C-level executive manager | Seniority level_Director | Seniority level_Entry level | Seniority level_Head | Seniority level_Intern | Seniority level_Junior | Seniority level_Lead | Seniority level_Manager | Seniority level_Middle | Seniority level_Principal | Seniority level_Senior | Seniority level_VP | Seniority level_Working Student | Seniority level_intern | Main Technology_.NET | Main Technology_.Net | Main Technology_.Net, Angular | Main Technology_.net | Main Technology_ABAP | Main Technology_AI | Main Technology_AWS | Main Technology_AWS, GCP, Python,K8s | Main Technology_Agile | Main Technology_Android | Main Technology_Android/Kotlin | Main Technology_Angular | Main Technology_Angular, TypeScript | Main Technology_Angular, Typescript | Main Technology_Apache Spark | Main Technology_Autonomous Driving | Main Technology_Aws Hadoop Postgre Typescript | Main Technology_Azure | Main Technology_Azure, SAP | Main Technology_BI, DWH, ETL/ELT | Main Technology_Bash | Main Technology_Blockchain | Main Technology_C | Main Technology_C# | Main Technology_C# .NET | Main Technology_C#, .net core | Main Technology_C#/.NET | Main Technology_C++ | Main Technology_C++, C# | Main Technology_C++, Java, Embedded C | Main Technology_C++/C# | Main Technology_C++/c | Main Technology_C, C++ | Main Technology_C/C++ | Main Technology_Clojure | Main Technology_Cloud | Main Technology_DevOps | Main Technology_Django, Flask, Plotly Dash | Main Technology_Elixir | Main Technology_Embedded | Main Technology_Embedded C++ | Main Technology_Erlang | Main Technology_FBD | Main Technology_Frontend | Main Technology_Frontend: react, node.js | Main Technology_GCP | Main Technology_Go | Main Technology_Go, PHP, SQL | Main Technology_Go/Python | Main Technology_Golang | Main Technology_Golang | Main Technology_Google Cloud Platform | Main Technology_Grails, Groovy | Main Technology_Haskell | Main Technology_JS | Main Technology_JS, Java | Main Technology_JS, WDIO | Main Technology_JavScript | Main Technology_Java | Main Technology_Java | Main Technology_Java & Distributed Systems Stuff | Main Technology_Java & PHP | Main Technology_Java / Scala | Main Technology_Java, .Net | Main Technology_Java, JS | Main Technology_Java, JavaScript | Main Technology_Java, Kotlin | Main Technology_Java, angular, Aws | Main Technology_Java, terraform | Main Technology_Java/C++ | Main Technology_Java/Groovy | Main Technology_Java/Kotlin | Main Technology_Java/Scala | Main Technology_JavaScript | Main Technology_JavaScript | Main Technology_JavaScript / TypeScript | Main Technology_JavaScript, TypeScript | Main Technology_JavaScript/TypeScript | Main Technology_JavaScript/Typescript | Main Technology_Javascript | Main Technology_Javascript | Main Technology_Javascript / Typescript | Main Technology_Javascript, Angular | Main Technology_Javascript/Typescript | Main Technology_Js | Main Technology_Js, reactJS | Main Technology_Kotlin | Main Technology_Kotlin | Main Technology_Kotlin, Java | Main Technology_Kuberenetes, Openstack | Main Technology_Kubernetes | Main Technology_Kubernetes, Terraform, GCP | Main Technology_Linux | Main Technology_Linux Kernel | Main Technology_Linux/UNIX, GIT, Virtualisation Platforms, *shell(s) scripts | Main Technology_ML | Main Technology_Magento | Main Technology_Ml/Python | Main Technology_NLP, Python | Main Technology_Node | Main Technology_Node.js | Main Technology_NodeJS | Main Technology_NodeJS, Typescript, AWS | Main Technology_NodeJS/TS | Main Technology_Nodejs | Main Technology_NodsJs | Main Technology_Objective-C | Main Technology_Oracle | Main Technology_PHP | Main Technology_PHP | Main Technology_PHP/MySQL | Main Technology_PL/SQL | Main Technology_PS, Sketch, React, CSS3 | Main Technology_Pegasystems platform | Main Technology_Perl | Main Technology_Php | Main Technology_Power BI | Main Technology_PowerShell | Main Technology_Pyrhon | Main Technology_Python | Main Technology_Python | Main Technology_Python (Django) | Main Technology_Python / JavaScript (React) | Main Technology_Python, Pytorch | Main Technology_Python, SQL | Main Technology_Python, Whole Apache Data Science Stack, AWS | Main Technology_Python, cloud computing | Main Technology_Python, statistics, SQL | Main Technology_Python/NLP | Main Technology_Python/SQL | Main Technology_QA | Main Technology_Qlik | Main Technology_Qlik BI Tool, SQL | Main Technology_Qml | Main Technology_R | Main Technology_React | Main Technology_React | Main Technology_React / JavaScript | Main Technology_Ruby | Main Technology_Ruby on Rails | Main Technology_SAP ABAP | Main Technology_SAP BW / ABAP | Main Technology_SAS,SQL,Python | Main Technology_SQL | Main Technology_SWIFT | Main Technology_Salesforce | Main Technology_Scala | Main Technology_Scala / Python | Main Technology_Scala, Apache Spark | Main Technology_Scala, React.js | Main Technology_Spark | Main Technology_Spark, Pytorch | Main Technology_Spring | Main Technology_Sql | Main Technology_Sql | Main Technology_Sql, BQ, tableau, gtm, ga | Main Technology_Swift | Main Technology_Swift, Objective-C | Main Technology_Swift, Objective-C, iOS | Main Technology_Swift, objective-c | Main Technology_Swift/Kotlin | Main Technology_Terraform | Main Technology_Terraform | Main Technology_Tricentis Tosca | Main Technology_TypeScript | Main Technology_TypeScript, JavaScript | Main Technology_TypeScript, Kotlin | Main Technology_TypeScript, React | Main Technology_TypeScript/Angular | Main Technology_Typescript | Main Technology_Typescript | Main Technology_Typescript, Web apps | Main Technology_VHDL | Main Technology_Web developer | Main Technology_c# | Main Technology_c++ | Main Technology_c/c++ | Main Technology_embedded | Main Technology_go | Main Technology_golang | Main Technology_iOS | Main Technology_java | Main Technology_java/scala/go/clouds/devops | Main Technology_javascript | Main Technology_jenkins bash | Main Technology_js | Main Technology_k8s | Main Technology_kotlin | Main Technology_networking, linux, automation, cloud | Main Technology_nothing | Main Technology_php | Main Technology_php, js, python, docker | Main Technology_pythin | Main Technology_python | Main Technology_python | Main Technology_scala | Main Technology_several | Main Technology_spark | Main Technology_sql | Main Technology_swift | Main Technology_typescript | Main Technology_yaml | Main Technology_С# | Employment status_Company Director | Employment status_Founder | Employment status_Full-time employee | Employment status_Full-time position, part-time position, & self-employed (freelancing, tutoring) | Employment status_Intern | Employment status_Part-time employee | Employment status_Self-employed (freelancer) | Employment status_Werkstudent | Employment status_Working Student | Employment status_full-time, but 32 hours per week (it was my request, I'm a student) | Сontract duration_Temporary contract | Сontract duration_Unlimited contract | Main language at work_Czech | Main language at work_English | Main language at work_English and German | Main language at work_French | Main language at work_German | Main language at work_Italian | Main language at work_Polish | Main language at work_Russian | Main language at work_Russian, English | Main language at work_Spanish | Main language at work_both | Main language at work_Русский | Company size_1000+ | Company size_101-1000 | Company size_11-50 | Company size_51-100 | Company size_up to 10 | Company type_Bank | Company type_Bank | Company type_Behörde | Company type_Bloody enterprise | Company type_Cloud | Company type_Construction | Company type_Consulting / Agency | Company type_Consulting and Product | Company type_Corporation | Company type_E-Commerce | Company type_E-commerce | Company type_Ecommerce | Company type_Energy | Company type_Enterprise | Company type_FAANG | Company type_Finance | Company type_Financial | Company type_Fintech | Company type_Full-time position in Education, part-time position in at a data startup | Company type_Game Company | Company type_Handel | Company type_IT Department of established business | Company type_Institute | Company type_Insurance | Company type_Market Research | Company type_Media | Company type_Non-tech retail | Company type_Old industry | Company type_Outsorce | Company type_Outsourse | Company type_Personal Ltd | Company type_Pharma | Company type_Product | Company type_Publisher | Company type_Publishing and Technology | Company type_Research | Company type_Research institute | Company type_Retail | Company type_SaaS | Company type_Science Institute | Company type_Semiconductor | Company type_Startup | Company type_Telecommunications | Company type_Transport & Logistics | Company type_University | Company type_Utilities | Company type_corporate incubator | Company type_e-commerce | Company type_eCommerce | Company type_service | Job lost due covid_Have been a freelancer at the beginning of year | Job lost due covid_Laid off for a bit | Job lost due covid_No | Job lost due covid_No, but there was a salary cut at 10% for 3 months and then at 5% for further 3 months | Job lost due covid_Yes | Job lost due covid_i didn't but will be looking for new one because of covid | Job lost due covid_yes but found a new one with better pay / perks | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.153846 | 0.20 | 0.214286 | 0.133333 | 0.486108 | 0.050505 | 0.532783 | 0.119048 | 0.555556 | 1.000 | 0.000 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0.576923 | 0.68 | 0.428571 | 0.200000 | 0.361107 | 0.000000 | 0.426225 | 0.000000 | 0.500000 | 1.000 | 0.000 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5 | 0.384615 | 0.20 | 0.071429 | 0.333333 | 0.458330 | 0.050505 | 0.540980 | 0.059524 | 0.555556 | 1.000 | 0.000 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 7 | 0.076923 | 0.20 | 0.071429 | 0.066667 | 0.381940 | 0.000000 | 0.450815 | 0.000000 | 0.388889 | 1.000 | 0.375 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 8 | 0.269231 | 0.32 | 0.142857 | 0.133333 | 0.319440 | 0.000000 | 0.368847 | 0.000000 | 0.444444 | 0.375 | 0.000 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
def svm(dataset, target):
brutto = dataset[target]
testRegrData = dataset.drop([target], axis=1)
bruttoPredict = LinearRegression()
bruttoPredict = bruttoPredict.fit(testRegrData.values, brutto)
#x = np.linspace(0, 175000, 175000)
#plt.plot(dataLearn[salaryColumnName], dataLearn['Age'], 'o')
accDif = 0.0
for i in range(len(testRegrData)):
p = bruttoPredict.predict([testRegrData.iloc[i]])
br = brutto.iloc[i]
accDif += np.abs(p - br)
accDif /= len(testRegrData)
print("Average delta: ", accDif)
sc = bruttoPredict.score(testRegrData, brutto)
print("Accuraci: ", sc)
svm(nData, "Yearly brutto")
Average delta: [0.02159963] Accuraci: 0.9342245214178913
cData = data2.copy(deep=True)
attributes = ["Age", "Total years of experience", "Years of experience in Germany", "Yearly brutto", "Yearly bonus + stocks in EUR", "Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country", "Bonus and stocks in same country", "Number of vacation days", "Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week", "Home office compensation"]
for attr in attributes:
#print("attr: ", attr)
#print("attr: ", cData[attr].describe())
#cData[attr] = pd.qcut(cData[attr], q=1) #TODO... increase q need normalize
for q in reversed(range(5)):
try:
cData[attr] = pd.qcut(cData[attr], q=q).astype(str)
print("attr: ", attr, " with q: ", q)
break
except:
continue
#pd.cut
#PD.CUT(column, bins=[ ],labels=[ ])
#pd.cut(df.Age,bins=[0,2,17,65,99],labels=[‘Toddler/Baby’,’Child’,’Adult’,’Elderly’])
#print()
#cData.head()
attr: Age with q: 4 attr: Total years of experience with q: 4 attr: Years of experience in Germany with q: 4 attr: Yearly brutto with q: 4 attr: Yearly bonus + stocks in EUR with q: 1 attr: Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country with q: 4 attr: Bonus and stocks in same country with q: 1 attr: Number of vacation days with q: 4 attr: Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week with q: 1 attr: Home office compensation with q: 1